Detecting and Correcting Errors in Genome Assemblies by
نویسندگان
چکیده
Title of Document: DETECTING AND CORRECTING ERRORS IN GENOME ASSEMBLIES Poorani Subramanian, Ph.D. 2010 Directed by: Professor James A. Yorke, Department of Mathematics Genome assemblies have various types of deficiencies or misassemblies. This work is aimed at detecting and correcting a type of misassembly that we call Compression/Expansion or CE misassemblies whereby a section of sequence has been erroneously omitted or inserted in the assembly. Other types of deficiencies include gaps in the genome sequence. We developed a statistic for identifying Compression/Expansion misassemblies called the CE statistic. It is based on examining the placement of mate pairs of reads in the assembly. In addition to this, we developed an algorithm that is aimed at closing gaps and validating and/or correcting CE misassemblies detected by the CE statistic. This algorithm is similar to a shooting algorithm used in solving two-point boundary value problems in partial differential equations. We call this algorithm the Shooting Method. The Shooting Method finds all possible ways to assemble a local region of the genome contained between two target reads. We use a combination of the CE statistic and Shooting Method to detect and correct some CE misassemblies and close gaps in genome assemblies. We tested our techniques both on faux and real data. Applying this technique to 22 bacterial draft assemblies for which the finished genome sequence is known, we were able to identify 5 out of 8 real CE misassemblies. We applied the Shooting Method to a de novo assembly of the Bos taurus genome made from Sanger data. We were able to close 9,863 gaps out of 58,386. This added 8.34 Mbp of sequence to the assembly, and resulted in a 7 % increase of N50 contig size.
منابع مشابه
Detecting and correcting mis-assembled reads in contigs
De novo assemblies do not have the possibility of quality control with an external sequence. In fact, accuracy and reliability of these assemblies is highly affected by sequencing errors and mis-assemblies. Here, a frequencybased algorithm is developed in Ruby and intended to discern assembly errors from polymorphisms/read errors and then edit or remove the misassembled read(s) to provide more ...
متن کاملAn approach to fault detection and correction in design of systems using of Turbo codes
We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...
متن کاملMaGuS: a tool for map-guided scaffolding and quality assessment of genome assemblies
Background Scaffolding is a crucial step in the genome assembly process. Current methods based on large fragment paired-end reads or long reads allow an increase in continuity but often lack consistency in repetitive regions, resulting in fragmented assemblies. Here, we describe a novel tool to link assemblies to a genome map to aid complex genome reconstruction by detecting assembly errors and...
متن کاملRun of Homozygosity a Procedure to Detecting Inbreeding in Farm Animals
Inbreeding depression is a harmful phenomenon in livestock which is outcome of inbreeding. Inbreeding is consequence mating between two individuals who are more related to each other than average relatedness in population, which results in reducing in fitness of progenies and genetic variability in populations. Development of high-density genome-wide single nucleotide polymorphism (SNP) array f...
متن کاملPilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement
Advances in modern sequencing technologies allow us to generate sufficient data to analyze hundreds of bacterial genomes from a single machine in a single day. This potential for sequencing massive numbers of genomes calls for fully automated methods to produce high-quality assemblies and variant calls. We introduce Pilon, a fully automated, all-in-one tool for correcting draft assemblies and c...
متن کامل